Read and format project data
# Include and execute your code here
df = pd.read_csv("https://github.com/byuidatascience/data4names/raw/master/data-raw/names_year/names_year.csv")Course DS 250
Josh Schaefer
Looking at all of the charts and tables that I have provided I can come to lots of interesting conclusion. The first thing I learned from these graphs was when I was born my name “Joshua” was more likely to be used during that year. What I learned from my second graph was that if someone was named “Brittany” they are most likely to be around 33 and very unlikely to be older than 45 and younger than 20. For my third graph I used a set of christian names and saw that most of them peaked around mid 1950s. After they started slowly declining. For my final graph I saw that if a movie character has a unique name it can influence its use dramatically. For example the use of the name “Forrest” had a 443% increase from 1984 to 1994. These are the main finding of my graphs and tables.
Highlight the Questions and Tasks
How does your name at your birth year compare to its use historically?
Looking at the line chart my name used at my current birth year(2004) is not that much out of the ordinary. In the year 2004 the name “Joshua” is used a lot, but not as much as about a decade before. Looking at the table that I made at the bottom shows the amount of people named “Joshua” in 2004 which is 20838 while the average for all of the years provided in the dataset is 17624. Historically I am more likely to be born in 2004 because it is higher than the average.
# Include and execute your code here
joshua_data = df.query("name == 'Joshua'")
total_joshuas_2004 = joshua_data.query("year == 2004")['Total'].sum()
fig = px.line(joshua_data, x='year', y='Total', title='Occurrences of the name "Joshua" over the years')
fig.add_annotation(
dict(
x=2004,
y=total_joshuas_2004.max(),
text='Year I was born(2004)',
showarrow=True,
arrowhead=2,
arrowcolor='red',
arrowwidth=2,
ax=20,
ay=-80,
)
)
fig.show()include figures in chunks and discuss your findings in the figure.
# Assuming joshua_data is your DataFrame
joshua_data = df.query("name == 'Joshua'")\
.groupby('year')\
.sum()\
.reset_index()\
.tail(15)\
.filter(["year", "Total"])
# Calculate the average for Joshua and round totals
average_value = round(joshua_data['Total'].mean())
joshua_data['Total'] = joshua_data['Total'].round()
# Create a DataFrame with two rows
result_df = pd.DataFrame({'year': [2004, 'Average'], 'Total': [joshua_data[joshua_data['year'] == 2004]['Total'].values[0], average_value]})
# Display the result_df
display(result_df)| year | Total | |
|---|---|---|
| 0 | 2004 | 20838.0 |
| 1 | Average | 17624.0 |
If you talked to someone named Brittany on the phone, what is your guess of his or her age? What ages would you not guess?
Looking at the bar chart that I created above I could get a pretty good idea of what year Brittany was most likely to be born in. My guess just from eyeballing it was between 31-37. I wanted to get the exact answer so I decided to make a program to do so. This gave me the age of 33 which means they were born in 1991. So if I was on the phone I would guess someone named Brittany would be around 33. I would not guess that they were younger than 20 and older than 45.
::: {#cell-Q2 chart .cell execution_count=6}
:::
Average year of Brittany: 33
Mary, Martha, Peter, and Paul are all Christian names. From 1920 - 2000, compare the name usage of each of the four names. What trends do you notice?
After creating both the line chart and the table I can see that at around 1952 was the peak of the use of all of these Christian names. After that it started slowly declining until around 1980 where you didn’t really see these names used as much. I find it super intresting why these names started not getting used anymore. I wonder if Christian names became less popular or the world just moved onto new names.
# Include and execute your code here
# Create a query to filter the data for the specified names
# Use query method to filter based on the values in the "name" column
names_data = df.query("name in ['Mary', 'Martha', 'Peter', 'Paul']")
mary_data = names_data.query("name == 'Mary'")
martha_data = names_data.query("name == 'Martha'")
peter_data = names_data.query("name == 'Peter'")
paul_data = names_data.query("name == 'Paul'")::: {#cell-Q3 chart .cell execution_count=9}
:::
::: {#cell-Q3 table .cell .tbl-cap-location-top tbl-cap=‘Not much of a table’ execution_count=10}
# Include and execute your code here
mary_peak_year = mary_data.loc[mary_data['Total'].idxmax(), 'year']
martha_peak_year = martha_data.loc[martha_data['Total'].idxmax(), 'year']
peter_peak_year = peter_data.loc[peter_data['Total'].idxmax(), 'year']
paul_peak_year = paul_data.loc[paul_data['Total'].idxmax(), 'year']
peak_years_df = pd.DataFrame({
'Name': ['Mary', 'Martha', 'Peter', 'Paul', 'Average'],
'Total': [mary_data['Total'].max(), martha_data['Total'].max(), peter_data['Total'].max(), paul_data['Total'].max(), None],
'Year': [mary_peak_year, martha_peak_year, peter_peak_year, paul_peak_year, None]
})
peak_years_df['Total'] = peak_years_df['Total'].round()
peak_years_df['Year'] = peak_years_df['Year'].round()
average_peak = peak_years_df['Year'][:-1].mean()
average_peak = math.ceil(average_peak)
peak_years_df.loc[4, 'Year'] = average_peak
display(peak_years_df)| Name | Total | Year | |
|---|---|---|---|
| 0 | Mary | 53791.0 | 1950.0 |
| 1 | Martha | 10651.0 | 1947.0 |
| 2 | Peter | 11321.0 | 1956.0 |
| 3 | Paul | 25662.0 | 1954.0 |
| 4 | Average | NaN | 1952.0 |
:::
Think of a unique name from a famous movie. Plot the usage of that name and see how changes line up with the movie release. Does it look like the movie had an effect on usage?
Looking at the line graph that I have provided you can see that the name Forrest started off with some pretty high usage between 1915-1930. Through the next couple decades it seems to have a slow decline until the release of the film “Forrest Gump” in 1994. You can see the spike in the graph and how much this movie effected the usage of the name. I decided to see the percent increase in the name Forrest from 1984-1994. It turned out to be around 443%. Knowing this I can come to the conclusion that the film “Forrest Gump” greatly effected the usage of the name.
::: {#cell-Q4 table .cell .tbl-cap-location-top tbl-cap=‘Not much of a table’ execution_count=11}
# Include and execute your code here
forrest_data = df.query("name == 'Forrest'")
fig = px.line(forrest_data, x='year', y='Total', title='Occurrences of the name "Forrest" over the years')
fig.add_annotation(
dict(
x=1994,
y=forrest_data['Total'].max(),
text='Forrest Gump Release(1994)',
showarrow=True,
arrowhead=2,
arrowcolor='red',
arrowwidth=2,
ax=20,
ay=-80,
)
)
fig.show():::